training cost
AppendixofSynergy-of-experts 1 TheoreticalProofs
From Figure 1(a), learning multiple linear sub-models and averaging the predictions (ensemble) is still a linear model, so it cannot tackleXOR problem. We compare the training cost of all methods from the two aspects;1). Thesub-model training enables themost adversarial attacks ofsub-models could be successfully defended. In particular, we train two kinds of models to defend against the attacks: 1). FromFigure2(a)and2(b),when0.01 ϵ 0.04, SoE without the collaboration training achieves a similar robustness compared with SoE.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Texas > Travis County > Austin (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.05)
FractionallySqueezingBitSavingsBoth
Recent breakthroughs in deep neural networks (DNNs) have motivated an explosive demand for intelligent edge devices. Many of them, such as autonomous vehicles and healthcare wearables, require real-time andon-site learning toenable them toproactivelylearn from newdataandadapt todynamic environments.
D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models
Continual Pre-Training (CPT) on Large Language Models (LLMs) has been widely used to expand the model's fundamental understanding of specific downstream domains (e.g., math and code). For the CPT on domain-specific LLMs, one important question is how to choose the optimal mixture ratio between the general-corpus (e.g., Dolma, Slim-pajama) and the downstream domain-corpus. Existing methods usually adopt laborious human efforts by grid-searching on a set of mixture ratios, which require high GPU training consumption costs. Besides, we cannot guarantee the selected ratio is optimal for the specific domain. To address the limitations of existing methods, inspired by the Scaling Law for performance prediction, we propose to investigate the Scaling Law of the Domain-specific Continual Pre-Training (D-CPT Law) to decide the optimal mixture ratio with acceptable training costs for LLMs of different sizes. Specifically, by fitting the D-CPT Law, we can easily predict the general and downstream performance of arbitrary mixture ratios, model sizes, and dataset sizes using small-scale training costs on limited experiments. Moreover, we also extend our standard D-CPT Law on cross-domain settings and propose the Cross-Domain D-CPT Law to predict the D-CPT law of target domains, where very small training costs (about 1\% of the normal training costs) are needed for the target domains. Comprehensive experimental results on six downstream domains demonstrate the effectiveness and generalizability of our proposed D-CPT Law and Cross-Domain D-CPT Law.
Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training
Recently, sparse training has emerged as a promising paradigm for efficient deep learning on edge devices. The current research mainly devotes the efforts to reducing training costs by further increasing model sparsity. However, increasing sparsity is not always ideal since it will inevitably introduce severe accuracy degradation at an extremely high sparsity level. This paper intends to explore other possible directions to effectively and efficiently reduce sparse training costs while preserving accuracy. To this end, we investigate two techniques, namely, layer freezing and data sieving. First, the layer freezing approach has shown its success in dense model training and fine-tuning, yet it has never been adopted in the sparse training domain.